Hardware acceleration components for translating guest instructions to native instructions

ABSTRACT

A hardware based translation accelerator. The hardware includes a guest fetch logic component for accessing guest instructions; a guest fetch buffer coupled to the guest fetch logic component and a branch prediction component for assembling guest instructions into a guest instruction block; and conversion tables coupled to the guest fetch buffer for translating the guest instruction block into a corresponding native conversion block. The hardware further includes a native cache coupled to the conversion tables for storing the corresponding native conversion block, and a conversion look aside buffer coupled to the native cache for storing a mapping of the guest instruction block to corresponding native conversion block, wherein upon a subsequent request for a guest instruction, the conversion look aside buffer is indexed to determine whether a hit occurred, wherein the mapping indicates the guest instruction has a corresponding converted native instruction in the native cache.

CROSS-REFERENCE TO RELATED APPLICATIONS

The present application is a continuation of U.S. patent applicationSer. No. 13/360,034, filed Jan. 27, 2012, entitled “HARDWAREACCELERATION COMPONENTS FOR TRANSLATING GUEST INSTRUCTIONS TO NATIVEINSTRUCTIONS,” naming Mohammad Abdallah as inventor, and having attorneydocket number SMII-0033.US, which is herein incorporated by reference inits entirety, and which claims the benefit of co-pending commonlyassigned U.S. Provisional Patent Application Ser. No. 61/436,966, titled“HARDWARE ACCELERATION COMPONENTS FOR TRANSLATING GUEST INSTRUCTIONS TONATIVE INSTRUCTIONS” by Mohammad A. Abdallah, filed on Jan. 27, 2011,and which is incorporated herein in its entirety.

FIELD OF THE INVENTION

The present invention is generally related to digital computer systems,more particularly, to a system and method for translating instructionscomprising an instruction sequence.

BACKGROUND OF THE INVENTION

Many types of digital computer systems utilize codetransformation/translation or emulation to implement software-basedfunctionality. Generally, translation and emulation both involveexamining a program of software instructions and performing thefunctions and actions dictated by the software instructions, even thoughthe instructions are not “native” to the computer system. In the case oftranslation, the non-native instructions are translated into a form ofnative instructions which are designed to execute on the hardware of thecomputer system. Examples include prior art translation software and/orhardware that operates with industry standard x86 applications to enablethe applications to execute on non-x86 or alternative computerarchitectures. Generally, a translation process utilizes a large numberof processor cycles, and thus, imposes a substantial amount of overhead.The performance penalty imposed by the overhead can substantially erodeany benefits provided by the translation process.

One attempt at solving this problem involves the use of just-in-timecompilation. Just-in-time compilation (JIT), also known as dynamictranslation, is a method to improve the runtime performance of computerprograms. Traditionally, computer programs had two modes of runtimetransformation, either interpretation mode or JIT (Just-In-Time)compilation/translation mode. Interpretation is a decoding process thatinvolves decoding instruction by instruction to transform the code fromguest to native with lower overhead than JIT compilation, but itproduces a transformed code that is less performing. Additionally, theinterpretation is invoked with every instruction. JIT compilers ortranslators represent a contrasting approach to interpretation. With JITconversion, it usually has a higher overhead than interpreters, but itproduces a translated code that is more optimized and one that hashigher execution performance. In most emulation implementations, thefirst time a translation is needed, it is done as an interpretation toreduce overhead, after the code is seen (executed) many times, a JITtranslation is invoked to create a more optimized translation.

However, the code transformation process still presents a number ofproblems. The JIT compilation process itself imposes a significantamount of overhead on the processor. This can cause a large delay in thestart up of the application. Additionally, managing the storage oftransformed code in system memory causes multiple trips back and forthto system memory and includes memory mapping and allocation managementoverhead, which imposes a significant latency penalty. Furthermore,changes to region of execution in the application involve relocating thetransformed code in the system memory and code cache, and starting ofthe process from scratch. The interpretation process involves lessoverhead than JIT translation but it's overhead is repeated perinstruction and thus is still relatively significant. The code producedis poorly optimized if at all.

SUMMARY OF THE INVENTION

Embodiments of the present invention implement an algorithm and anapparatus that enables a hardware based acceleration of a guestinstruction to native instruction translation process.

In one embodiment, the present invention is implemented as a hardwarebased translation accelerator. The hardware based translationaccelerator includes a guest fetch logic component for accessing aplurality of guest instructions; a guest fetch buffer coupled to theguest fetch logic component and a branch prediction component forassembling the plurality of guest instructions into a guest instructionblock; and a plurality of conversion tables coupled to the guest fetchbuffer for translating the guest instruction block into a correspondingnative conversion block.

The hardware based translation accelerator further includes a nativecache coupled to the conversion tables for storing the correspondingnative conversion block, and a conversion look aside buffer coupled tothe native cache for storing a mapping of the guest instruction block tocorresponding native conversion block, wherein upon a subsequent requestfor a guest instruction, the conversion look aside buffer is indexed todetermine whether a hit occurred, wherein the mapping indicates theguest instruction has a corresponding converted native instruction inthe native cache. In response to the hit the conversion look asidebuffer forwards the translated native instruction for execution.

The foregoing is a summary and thus contains, by necessity,simplifications, generalizations and omissions of detail; consequently,those skilled in the art will appreciate that the summary isillustrative only and is not intended to be in any way limiting. Otheraspects, inventive features, and advantages of the present invention, asdefined solely by the claims, will become apparent in the non-limitingdetailed description set forth below.

BRIEF DESCRIPTION OF THE DRAWINGS

The present invention is illustrated by way of example, and not by wayof limitation, in the figures of the accompanying drawings and in whichlike reference numerals refer to similar elements.

FIG. 1 shows an exemplary sequence of instructions operated on by oneembodiment of the present invention.

FIG. 2 shows a diagram depicting a block-based translation process whereguest instruction blocks are converted to native conversion blocks inaccordance with one embodiment of the present invention.

FIG. 3 shows a diagram illustrating the manner in which each instructionof a guest instruction block is converted to a corresponding nativeinstruction of a native conversion block in accordance with oneembodiment of the present invention.

FIG. 4 shows a diagram illustrating the manner in which far branches areprocessed with handling of native conversion blocks in accordance withone embodiment of the present invention.

FIG. 5 shows a diagram of an exemplary hardware accelerated conversionsystem illustrating the manner in which guest instruction blocks andtheir corresponding native conversion blocks are stored within a cachein accordance with one embodiment of the present invention.

FIG. 6 shows a more detailed example of a hardware acceleratedconversion system in accordance with one embodiment of the presentinvention.

FIG. 7 shows an example of a hardware accelerated conversion systemhaving a secondary software-based accelerated conversion pipeline inaccordance with one embodiment of the present invention.

FIG. 8 shows an exemplary flow diagram illustrating the manner in whichthe CLB functions in conjunction with the code cache and the guestinstruction to native instruction mappings stored within memory inaccordance with one embodiment of the present invention.

FIG. 9 shows an exemplary flow diagram illustrating a physical storagestack code cache implementation and the guest instruction to nativeinstruction mappings in accordance with one embodiment of the presentinvention.

FIG. 10 shows a diagram depicting additional exemplary details of ahardware accelerated conversion system in accordance with one embodimentof the present invention.

FIG. 11A shows a diagram of an exemplary pattern matching processimplemented by embodiments of the present invention.

FIG. 11B shows a diagram of a SIMD register-based pattern matchingprocess in accordance with one embodiment of the present invention.

FIG. 12 shows a diagram of a unified register file in accordance withone embodiment of the present invention.

FIG. 13 shows a diagram of a unified shadow register file and pipelinearchitecture 1300 that supports speculative architectural states andtransient architectural states in accordance with one embodiment of thepresent invention.

FIG. 14 shows a diagram of the second usage model, including dual scopeusage in accordance with one embodiment of the present invention.

FIG. 15 shows a diagram of the third usage model, including transientcontext switching without the need to save and restore a prior contextupon returning from the transient context in accordance with oneembodiment of the present invention.

FIG. 16 shows an diagram depicting a case where the exception in theinstruction sequence is because translation for subsequent code isneeded in accordance with one embodiment of the present invention.

FIG. 17 shows a diagram of the fourth usage model, including transientcontext switching without the need to save and restore a prior contextupon returning from the transient context in accordance with oneembodiment of the present invention.

FIG. 18 shows a diagram of an exemplary microprocessor pipeline inaccordance with one embodiment of the present invention.

DETAILED DESCRIPTION OF THE INVENTION

Although the present invention has been described in connection with oneembodiment, the invention is not intended to be limited to the specificforms set forth herein. On the contrary, it is intended to cover suchalternatives, modifications, and equivalents as can be reasonablyincluded within the scope of the invention as defined by the appendedclaims.

In the following detailed description, numerous specific details such asspecific method orders, structures, elements, and connections have beenset forth. It is to be understood however that these and other specificdetails need not be utilized to practice embodiments of the presentinvention. In other circumstances, well-known structures, elements, orconnections have been omitted, or have not been described in particulardetail in order to avoid unnecessarily obscuring this description.

References within the specification to “one embodiment” or “anembodiment” are intended to indicate that a particular feature,structure, or characteristic described in connection with the embodimentis included in at least one embodiment of the present invention. Theappearance of the phrase “in one embodiment” in various places withinthe specification are not necessarily all referring to the sameembodiment, nor are separate or alternative embodiments mutuallyexclusive of other embodiments. Moreover, various features are describedwhich may be exhibited by some embodiments and not by others. Similarly,various requirements are described which may be requirements for someembodiments but not other embodiments.

Some portions of the detailed descriptions, which follow, are presentedin terms of procedures, steps, logic blocks, processing, and othersymbolic representations of operations on data bits within a computermemory. These descriptions and representations are the means used bythose skilled in the data processing arts to most effectively convey thesubstance of their work to others skilled in the art. A procedure,computer executed step, logic block, process, etc., is here, andgenerally, conceived to be a self-consistent sequence of steps orinstructions leading to a desired result. The steps are those requiringphysical manipulations of physical quantities. Usually, though notnecessarily, these quantities take the form of electrical or magneticsignals of a computer readable storage medium and are capable of beingstored, transferred, combined, compared, and otherwise manipulated in acomputer system. It has proven convenient at times, principally forreasons of common usage, to refer to these signals as bits, values,elements, symbols, characters, terms, numbers, or the like.

It should be borne in mind, however, that all of these and similar termsare to be associated with the appropriate physical quantities and aremerely convenient labels applied to these quantities. Unlessspecifically stated otherwise as apparent from the followingdiscussions, it is appreciated that throughout the present invention,discussions utilizing terms such as “processing” or “accessing” or“writing” or “storing” or “replicating” or the like, refer to the actionand processes of a computer system, or similar electronic computingdevice that manipulates and transforms data represented as physical(electronic) quantities within the computer system's registers andmemories and other computer readable media into other data similarlyrepresented as physical quantities within the computer system memoriesor registers or other such information storage, transmission or displaydevices.

Embodiments of the present invention function by greatly acceleratingthe process of translating guest instructions from a guest instructionarchitecture into native instructions of a native instructionarchitecture for execution on a native processor. Embodiments of thepresent invention utilize hardware-based units to implement hardwareacceleration for the conversion process. The guest instructions can befrom a number of different instruction architectures. Examplearchitectures include Java or JavaScript, x86, MIPS, SPARC, and thelike. These guest instructions are rapidly converted into nativeinstructions and pipelined to the native processor hardware for rapidexecution. This provides a much higher level of performance incomparison to traditional software controlled conversion processes.

In one embodiment, the present invention implements a flexibleconversion process that can use as inputs a number of differentinstruction architectures. In such an embodiment, the front end of theprocessor is implemented such that it can be software controlled, whiletaking advantage of hardware accelerated conversion processing todeliver the much higher level of performance. Such an implementationdelivers benefits on multiple fronts. Different guest architectures canbe processed and converted while each receives the benefits of thehardware acceleration to enjoy a much higher level of performance. Thesoftware controlled front end can provide a great degree of flexibilityfor applications executing on the processor. The hardware accelerationcan achieve near native hardware speed for execution of the guestinstructions of a guest application. In the descriptions which follow,FIG. 1 through FIG. 4 shows the manner in which embodiments of thepresent invention handle guest instruction sequences and handle nearbranches and far branches within those guest instruction sequences. FIG.5 shows an overview of an exemplary hardware accelerated conversionprocessing system in accordance with one embodiment of the presentinvention.

FIG. 1 shows an exemplary sequence of instructions operated on by oneembodiment of the present invention. As depicted in FIG. 1, theinstruction sequence 100 comprises 16 instructions, proceeding from thetop of FIG. 1 to the bottom. As can be seen in FIG. 1, the sequence 100includes four branch instructions 101-104.

One objective of embodiments of the present invention is to processentire groups of instructions as a single atomic unit. This atomic unitis referred to as a block. A block of instructions can extend well pastthe 16 instructions shown in FIG. 1. In one embodiment, a block willinclude enough instructions to fill a fixed size (e.g., 64 bytes, 128bytes, 256 bytes, or the like), or until an exit condition isencountered. In one embodiment, the exit condition for concluding ablock of instructions is the encounter of a far branch instruction. Asused herein in the descriptions of embodiments, a far branch refers to abranch instruction whose target address resides outside the currentblock of instructions. In other words, within a given guest instructionblock, a far branch has a target that resides in some other block or insome other sequence of instructions outside the given instruction block.Similarly, a near branch refers to a branch instruction whose targetaddress resides inside the current block of instructions. Additionally,it should be noted that a native instruction block can contain multipleguest far branches. These terms are further described in the discussionswhich follow below.

FIG. 2 shows a diagram depicting a block-based conversion process, whereguest instruction blocks are converted to native conversion blocks inaccordance with one embodiment of the present invention. As illustratedin FIG. 2, a plurality of guest instruction blocks 201 are shown beingconverted to a corresponding plurality of native conversion blocks 202.

Embodiments of the present invention function by converting instructionsof a guest instruction block into corresponding instructions of a nativeconversion block. Each of the blocks 201 are made up of guestinstructions. As described above, these guest instructions can be from anumber of different guest instruction architectures (e.g., Java orJavaScript, x86, MIPS, SPARC, etc.). Multiple guest instruction blockscan be converted into one or more corresponding native conversionblocks. This conversion occurs on a per instruction basis.

FIG. 2 also illustrates the manner in which guest instruction blocks areassembled into sequences based upon a branch prediction. This attributeenables embodiments of the present invention to assemble sequences ofguest instructions based upon the predicted outcomes of far branches.Based upon far branch prediction, a sequence of guest instructions isassembled from multiple guest instruction blocks and converted to acorresponding native conversion block. This aspect is further describedin FIG. 3 and FIG. 4 below.

FIG. 3 shows a diagram illustrating the manner in which each instructionof a guest instruction block is converted to a corresponding nativeinstruction of a native conversion block in accordance with oneembodiment of the present invention. As illustrated in FIG. 3, the guestinstruction blocks reside within a guest instruction buffer 301.Similarly, the native conversion block(s) reside within a nativeinstruction buffer 302.

FIG. 3 shows an attribute of embodiments of the present invention, wherethe target addresses of the guest branch instructions are converted totarget addresses of the native branch instructions. For example, theguest instruction branches each include an offset that identifies thetarget address of the particular branch. This is shown in FIG. 3 as theguests offset, or G_offset. As guest instructions are converted, thisoffset is often different because of the different lengths or sequencesrequired by the native instructions to produce the functionality of thecorresponding guest instructions. For example, the guest instructionsmay be of different lengths in comparison to their corresponding nativeinstructions. Hence, the conversion process compensates for thisdifference by computing the corresponding native offset. This is shownin FIG. 3 as the native offset, or N_offset.

It should be noted that the branches that have targets within a guestinstruction block, referred to as near branches, are not predicted, andtherefore do not alter the flow of the instruction sequence.

FIG. 4 shows a diagram illustrating the manner in which far branches areprocessed with handling of native conversion blocks in accordance withone embodiment of the present invention. As illustrated in FIG. 4, theguest instructions are depicted as a guest instruction sequence inmemory 401. Similarly, the native instructions are depicted as a nativeinstruction sequence in memory 402.

In one embodiment, every instruction block, both guest instructionblocks and native instruction blocks, concludes with a far branch (e.g.,even though native blocks can contain multiple guest far branches). Asdescribed above, a block will include enough instructions to fill afixed size (e.g., 64 bytes, 128 bytes, 256 bytes, or the like) or untilan exit condition, such as, for example, the last guest far branchinstruction, is encountered. If a number of guest instructions have beenprocessed to assemble a guest instruction block and a far branch has notbeen encountered, then a guest far branch is inserted to conclude theblock. This far branch is merely a jump to the next subsequent block.This ensures that instruction blocks conclude with a branch that leadsto either another native instruction block, or another sequence of guestinstructions in memory. Additionally, as shown in FIG. 4 a block caninclude a guest far branch within its sequence of instructions that doesnot reside at the end of the block. This is shown by the guestinstruction far branch 411 and the corresponding native instructionguest far branch 412.

In the FIG. 4 embodiment, the far branch 411 is predicted taken. Thusthe instruction sequence jumps to the target of the far branch 411,which is the guest instruction F. Similarly, in the corresponding nativeinstructions, a far branch 412 is followed by the native instruction F.The near branches are not predicted. Thus, they do not alter theinstruction sequence in the same manner as far branches.

In this manner, embodiments of the present invention generate a trace ofconversion blocks, where each block comprises a number (e.g., 3-4) offar branches. This trace is based on guest far branch predictions.

In one embodiment, the far branches within the native conversion blockinclude a guest address that is the opposite address for the opposingbranch path. As described above, a sequence of instructions is generatedbased upon the prediction of far branches. The true outcome of theprediction will not be known until the corresponding native conversionblock is executed. Thus, once a false prediction is detected, the falsefar branch is examined to obtain the opposite guest address for theopposing branch path. The conversion process then continues from theopposite guest address, which is now the true branch path. In thismanner, embodiments of the present invention use the included oppositeguest address for the opposing branch path to recover from occasionswhere the predicted outcome of a far branch is false. Hence, if a farbranch predicted outcome is false, the process knows where to go to findthe correct guest instruction. Similarly, if the far branch predictedoutcome is true, the opposite guest address is ignored. It should benoted that if far branches within native instruction block are predictedcorrectly, no entry point in CLB for their target blocks is needed.However, once a miss prediction occurs, a new entry for the target blockneeds to be inserted in CLB. This function is performed with the goal ofpreserving CLB capacity.

FIG. 5 shows a diagram of an exemplary hardware accelerated conversionsystem 500 illustrating the manner in which guest instruction blocks andtheir corresponding native conversion blocks are stored within a cachein accordance with one embodiment of the present invention. Asillustrated in FIG. 5, a conversion look aside buffer 506 is used tocache the address mappings between guest and native blocks; such thatthe most frequently encountered native conversion blocks are accessedthrough low latency availability to the processor 508.

The FIG. 5 diagram illustrates the manner in which frequentlyencountered native conversion blocks are maintained within a high-speedlow latency cache, the conversion look aside buffer 506. The componentsdepicted in FIG. 5 implement hardware accelerated conversion processingto deliver the much higher level of performance.

The guest fetch logic unit 502 functions as a hardware-based guestinstruction fetch unit that fetches guest instructions from the systemmemory 501. Guest instructions of a given application reside withinsystem memory 501. Upon initiation of a program, the hardware-basedguest fetch logic unit 502 starts prefetching guess instructions into aguest fetch buffer 503. The guest fetch buffer 507 accumulates the guestinstructions and assembles them into guest instruction blocks. Theseguest instruction blocks are converted to corresponding nativeconversion blocks by using the conversion tables 504. The convertednative instructions are accumulated within the native conversion buffer505 until the native conversion block is complete. The native conversionblock is then transferred to the native cache 507 and the mappings arestored in the conversion look aside buffer 506. The native cache 507 isthen used to feed native instructions to the processor 508 forexecution. In one embodiment, the functionality implemented by the guestfetch logic unit 502 is produced by a guest fetch logic state machine.

As this process continues, the conversion look aside buffer 506 isfilled with address mappings of guest blocks to native blocks. Theconversion look aside buffer 506 uses one or more algorithms (e.g.,least recently used, etc.) to ensure that block mappings that areencountered more frequently are kept within the buffer, while blockmappings that are rarely encountered are evicted from the buffer. Inthis manner, hot native conversion blocks mappings are stored within theconversion look aside buffer 506. In addition, it should be noted thatthe well predicted far guest branches within the native block do notneed to insert new mappings in the CLB because their target blocks arestitched within a single mapped native block, thus preserving a smallcapacity efficiency for the CLB structure. Furthermore, in oneembodiment, the CLB is structured to store only the ending guest tonative address mappings. This aspect also preserves the small capacityefficiency of the CLB.

The guest fetch logic 502 looks to the conversion look aside buffer 506to determine whether addresses from a guest instruction block havealready been converted to a native conversion block. As described above,embodiments of the present invention provide hardware acceleration forconversion processing. Hence, the guest fetch logic 502 will look to theconversion look aside buffer 506 for pre-existing native conversionblock mappings prior to fetching a guest address from system memory 501for a new conversion.

In one embodiment, the conversion look aside buffer is indexed by guestaddress ranges, or by individual guest address. The guest address rangesare the ranges of addresses of guest instruction blocks that have beenconverted to native conversion blocks. The native conversion blockmappings stored by a conversion look aside buffer are indexed via theircorresponding guest address range of the corresponding guest instructionblock. Hence, the guest fetch logic can compare a guest address with theguest address ranges or the individual guest address of convertedblocks, the mappings of which are kept in the conversion look asidebuffer 506 to determine whether a pre-existing native conversion blockresides within what is stored in the native cache 507 or in the codecache of FIG. 6. If the pre-existing native conversion block is ineither of the native cache or in the code cache, the correspondingnative conversion instructions are forwarded from those caches directlyto the processor.

In this manner, hot guest instruction blocks (e.g., guest instructionblocks that are frequently executed) have their corresponding hot nativeconversion blocks mappings maintained within the high-speed low latencyconversion look aside buffer 506. As blocks are touched, an appropriatereplacement policy ensures that the hot blocks mappings remain withinthe conversion look aside buffer. Hence, the guest fetch logic 502 canquickly identify whether requested guest addresses have been previouslyconverted, and can forward the previously converted native instructionsdirectly to the native cache 507 for execution by the processor 508.These aspects save a large number of cycles, since trips to systemmemory can take 40 to 50 cycles or more. These attributes (e.g., CLB,guest branch sequence prediction, guest & native branch buffers, nativecaching of the prior) allow the hardware acceleration functionality ofembodiments of the present invention to achieve application performanceof a guest application to within 80% to 100% the application performanceof a comparable native application.

In one embodiment, the guest fetch logic 502 continually prefetchesguest instructions for conversion independent of guest instructionrequests from the processor 508. Native conversion blocks can beaccumulated within a conversion buffer “code cache” in the system memory501 for those less frequently used blocks. The conversion look asidebuffer 506 also keeps the most frequently used mappings. Thus, if arequested guest address does not map to a guest address in theconversion look aside buffer, the guest fetch logic can check systemmemory 501 to determine if the guest address corresponds to a nativeconversion block stored therein.

In one embodiment, the conversion look aside buffer 506 is implementedas a cache and utilizes cache coherency protocols to maintain coherencywith a much larger conversion buffer stored in higher levels of cacheand system memory 501. The native instructions mappings that are storedwithin the conversion look aside buffer 506 are also written back tohigher levels of cache and system memory 501. Write backs to systemmemory maintain coherency. Hence, cache management protocols can be usedto ensure the hot native conversion blocks mappings are stored withinthe conversion look aside buffer 506 and the cold native conversionmappings blocks are stored in the system memory 501. Hence, a muchlarger form of the conversion buffer 506 resides in system memory 501.

It should be noted that in one embodiment, the exemplary hardwareaccelerated conversion system 500 can be used to implement a number ofdifferent virtual storage schemes. For example, the manner in whichguest instruction blocks and their corresponding native conversionblocks are stored within a cache can be used to support a virtualstorage scheme. Similarly, a conversion look aside buffer 506 that isused to cache the address mappings between guest and native blocks canbe used to support the virtual storage scheme (e.g., management ofvirtual to physical memory mappings).

In one embodiment, the FIG. 5 architecture implements virtualinstruction set processor/computer that uses a flexible conversionprocess that can receive as inputs a number of different instructionarchitectures. In such a virtual instruction set processor, the frontend of the processor is implemented such that it can be softwarecontrolled, while taking advantage of hardware accelerated conversionprocessing to deliver the much higher level of performance. Using suchan implementation, different guest architectures can be processed andconverted while each receives the benefits of the hardware accelerationto enjoy a much higher level of performance. Example guest architecturesinclude Java or JavaScript, x86, MIPS, SPARC, and the like. In oneembodiment, the “guest architecture” can be native instructions (e.g.,from a native application/macro-operation) and the conversion processproduces optimize native instructions (e.g., optimized nativeinstructions/micro-operations). The software controlled front end canprovide a large degree of flexibility for applications executing on theprocessor. As described above, the hardware acceleration can achievenear native hardware speed for execution of the guest instructions of aguest application.

FIG. 6 shows a more detailed example of a hardware acceleratedconversion system 600 in accordance with one embodiment of the presentinvention. System 600 performers in substantially the same manner assystem 500 described above. However, system 600 shows additional detailsdescribing functionality of an exemplary hardware acceleration process.

The system memory 601 includes the data structures comprising the guestcode 602, the conversion look aside buffer 603, optimizer code 604,converter code 605, and native code cache 606. System 600 also shows ashared hardware cache 607 where guest instructions and nativeinstructions can both be interleaved and shared. The guest hardwarecache 610 catches those guest instructions that are most frequentlytouched from the shared hardware cache 607.

The guest fetch logic 620 prefetches guest instructions from the guestcode 602. The guest fetch logic 620 interfaces with a TLB 609 whichfunctions as a conversion look aside buffer that translates virtualguest addresses into corresponding physical guest addresses. The TLB 609can forward hits directly to the guest hardware cache 610. Guestinstructions that are fetched by the guest fetch logic 620 are stored inthe guest fetch buffer 611.

The conversion tables 612 and 613 include substitute fields and controlfields and function as multilevel conversion tables for translatingguest instructions received from the guest fetch buffer 611 into nativeinstructions.

The multiplexers 614 and 615 transfer the converted native instructionsto a native conversion buffer 616. The native conversion buffer 616accumulates the converted native instructions to assemble nativeconversion blocks. These native conversion blocks are then transferredto the native hardware cache 600 and the mappings are kept in theconversion look aside buffer 630.

The conversion look aside buffer 630 includes the data structures forthe converted blocks entry point address 631, the native address 632,the converted address range 633, the code cache and conversion lookaside buffer management bits 634, and the dynamic branch bias bits 635.The guest branch address 631 and the native address 632 comprise a guestaddress range that indicates which corresponding native conversionblocks reside within the converted lock range 633. Cache managementprotocols and replacement policies ensure the hot native conversionblocks mappings reside within the conversion look aside buffer 630 whilethe cold native conversion blocks mappings reside within the conversionlook aside buffer data structure 603 in system memory 601.

As with system 500, system 600 seeks to ensure the hot blocks mappingsreside within the high-speed low latency conversion look aside buffer630. Thus, when the fetch logic 640 or the guest fetch logic 620 looksto fetch a guest address, in one embodiment, the fetch logic 640 canfirst check the guest address to determine whether the correspondingnative conversion block resides within the code cache 606. This allows adetermination as to whether the requested guest address has acorresponding native conversion block in the code cache 606. If therequested guest address does not reside within either the buffer 603 or608, or the buffer 630, the guest address and a number of subsequentguest instructions are fetched from the guest code 602 and theconversion process is implemented via the conversion tables 612 and 613.

FIG. 7 shows an example of a hardware accelerated conversion system 700having a secondary software-based accelerated conversion pipeline inaccordance with one embodiment of the present invention.

The components 711-716 comprise a software implemented load store paththat is instantiated within a specialized high speed memory 760. Asdepicted in FIG. 7, the guest fetch buffer 711, conversion tables712-713 and native conversion buffer 716 comprise allocated portions ofthe specialized high speed memory 760. In many respects, the specializedhigh-speed memory 760 functions as a very low-level fast cache (e.g., L0cache).

The arrow 761 illustrates the attribute whereby the conversions areaccelerated via a load store path as opposed to an instruction fetchpath (e.g., from the fetched decode logic).

In the FIG. 7 embodiment, the high-speed memory 760 includes speciallogic for doing comparisons. Because of this, the conversionacceleration can be implemented in software. For example, in anotherembodiment, the standard memory 760 that stores the components 711-716is manipulated by software which uses a processor execution pipeline,where it loads values from said components 711-716 into one or more SIMDregister(s) and implements a compare instruction that performs a comparebetween the fields in the SIMD register and, as needed, perform a maskoperation and a result scan operation. A load store path can beimplemented using general purpose microprocessor hardware, such as, forexample, using compare instructions that compare one to many.

It should be noted that the memory 760 is accessed by instructions thathave special attributes or address ranges. For example, in oneembodiment, the guest fetch buffer has an ID for each guest instructionentry. The ID is created per guest instruction. This ID allows easymapping from the guest buffer to the native conversion buffer. The IDallows an easy calculation of the guest offset to the native offset,irrespective of the different lengths of the guest instructions incomparison to the corresponding native instructions. This aspect isdiagramed in FIG. 3 above.

In one embodiment the ID is calculated by hardware using a lengthdecoder that calculates the length of the fetched guest instruction.However, it should be noted that this functionality can be performed inhardware or software.

Once IDs have been assigned, the native instructions buffer can beaccessed via the ID. The ID allows the conversion of the offset fromguest offset to the native offset.

FIG. 8 shows an exemplary flow diagram illustrating the manner in whichthe CLB functions in conjunction with the code cache and the guestinstruction to native instruction mappings stored within memory inaccordance with one embodiment of the present invention.

As described above, the CLB is used to store mappings of guest addressesthat have corresponding converted native addresses stored within thecode cache memory (e.g., the guest to native address mappings). In oneembodiment, the CLB is indexed with a portion of the guest address. Theguest address is partitioned into an index, a tag, and an offset (e.g.,chunk size). This guest address comprises a tag that is used to identifya match in the CLB entry that corresponds to the index. If there is ahit on the tag, the corresponding entry will store a pointer thatindicates where in the code cache memory 806 the corresponding convertednative instruction chunk (e.g., the corresponding block of convertednative instructions) can be found.

It should be noted that the term “chunk” as used herein refers to acorresponding memory size of the converted native instruction block. Forexample, chunks can be different in size depending on the differentsizes of the converted native instruction blocks.

With respect to the code cache memory 806, in one embodiment, the codecache is allocated in a set of fixed size chunks (e.g., with differentsize for each chunk type). The code cache can be partitioned logicallyinto sets and ways in system memory and all lower level HW caches (e.g.,native hardware cache 608, shared hardware cache 607). The CLB can usethe guest address to index and tag compare the way tags for the codecache chunks.

FIG. 8 depicts the CLB hardware cache 804 storing guest address tags in2 ways, depicted as way x and way y. It should be noted that, in oneembodiment, the mapping of guest addresses to native addresses using theCLB structures can be done through storing the pointers to the nativecode chunks (e.g., from the guest to native address mappings) in thestructured ways. Each way is associated with a tag. The CLB is indexedwith the guest address 802 (comprising a tag). On a hit in the CLB, thepointer corresponding to the tag is returned. This pointer is used toindex the code cache memory. This is shown in FIG. 8 by the line “nativeaddress of code chunk=Seg#+F(pt)” which represents the fact that thenative address of the code chunk is a function of the pointer and thesegment number. In the present embodiment, the segment refers to a basefor a point in memory where the pointer scope is virtually mapped (e.g.,allowing the pointer array to be mapped into any region in the physicalmemory).

Alternatively, in one embodiment, the code cache memory can be indexedvia a second method, as shown in FIG. 8 by the line “Native Address ofcode chunk=seg#+Index*(size of chunk)+way# *(Chunk size)”. In such anembodiment, the code cache is organized such that its way-structuresmatch the CLB way structuring so that a 1:1 mapping exist between theways of CLB and the ways of the code cache chunks. When there is a hitin a particular CLB way then the corresponding code chunk in thecorresponding way of the code cache has the native code.

Referring still to FIG. 8, if the index of the CLB misses, the higherhierarchies of memory can be checked for a hit (e.g., L1 cache, L2cache, and the like). If there is no hit in these higher cache levels,the addresses in the system memory 801 are checked. In one embodiment,the guest index points to a entry comprising, for example, 64 chunks.The tags of each one of the 64 chunks are read out and compared againstthe guest tag to determine whether there is a hit. This process is shownin FIG. 8 by the dotted box 805. If there is no hit after the comparisonwith the tags in system memory, there is no conversion present at anyhierarchical level of memory, and the guest instruction must beconverted.

It should be noted that embodiments of the present invention manage eachof the hierarchical levels of memory that store the guest to nativeinstruction mappings in a cache like manner. This comes inherently fromcache-based memory (e.g., the CLB hardware cache, the native cache, L1and L2 caches, and the like). However, the CLB also includes “codecache+CLB management bits” that are used to implement a least recentlyused (LRU) replacement management policy for the guest to nativeinstruction mappings within system memory 801. In one embodiment, theCLB management bits (e.g., the LRU bits) are software managed. In thismanner, all hierarchical levels of memory are used to store the mostrecently used, most frequently encountered guest to native instructionmappings. Correspondingly, this leads to all hierarchical levels ofmemory similarly storing the most frequently encountered convertednative instructions.

FIG. 8 also shows dynamic branch bias bits and/or branch history bitsstored in the CLB. These dynamic branch bits are used to track thebehavior of branch predictions used in assembling guest instructionsequences. These bits are used to track which branch predictions aremost often correctly predicted and which branch predictions are mostoften predicted incorrectly. The CLB also stores data for convertedblock ranges. This data enables the process to invalidate the convertedblock range in the code cache memory where the corresponding guestinstructions have been modified (e.g., as in self modifying code).

FIG. 9 shows an exemplary flow diagram illustrating a physical storagestack cache implementation and the guest address to native addressmappings in accordance with one embodiment of the present invention. Asdepicted in FIG. 9, the cache can be implemented as a physical storagestack 901.

FIG. 9 embodiment illustrates the manner in which a code cache can beimplemented as a variable structure cache. Depending upon therequirements of different embodiments, the variable structure cache canbe completely hardware implemented and controlled, completely softwareimplemented and controlled, or some mixture of software intimidation andcontrol and underlying hardware enablement.

The FIG. 9 embodiment is directed towards striking an optimal balancefor the task of managing the allocation and replacement of the guest tonative address mappings and their corresponding translations in theactual physical storage. In the present embodiment, this is accomplishedthrough the use of a structure that combines the pointers with variablesize chunks.

A multi-way tag array is used to store pointers for different sizegroups of physical storage. Each time a particular storage size needs tobe allocated (e.g., where the storage size corresponds to an address),then accordingly, a group of storage blocks each corresponding to thatsize is allocated. This allows an embodiment of the present invention toprecisely allocate storage to store variable size traces ofinstructions. FIG. 9 shows how groups can be of different sizes. Twoexemplary group sizes are shown, “replacement candidate for group size4” and “replacement candidate for group size 2”. A pointer is stored inthe TAG array (in addition to the tag that correspond to the address)that maps the address into the physical storage address. The tags cancomprise two or more sub-tags. For example, the top 3 tags in the tagstructure 902 comprise sub tags A1 B1, A2 B2 C2 D2, and A3 B3respectively as shown. Hence, tag A2 B2 C2 D2 comprises a group size 4,while tag A1 B1 comprises a group size 2. The group size mask alsoindicates the size of the group.

The physical storage can then be managed like a stack, such that everytime there is a new group allocated, it can be placed on top of thephysical storage stack. Entries are invalidated by overwriting theirtag, thereby recovering the allocated space.

FIG. 9 also shows an extended way tag structure 903. In somecircumstances, an entry in the tag structure 902 will have acorresponding entry in the extended way tag structure 903. This dependson upon whether the entry and the tag structure has an extended way bitset (e.g., set to one). For example, the extended way bit set to oneindicates that there are corresponding entries in the extended way tagstructure. The extended way tag structure allows the processor to extendlocality of reference in a different way from the standard tagstructure. Thus, although the tag structure 902 is indexed in one manner(e.g., index (j)), the extended way tag structure is indexed in adifferent manner (e.g., index (k)).

In a typical implementation, the index(J) can be much larger number ofentries within the index(k). This is because, in most limitations, theprimary tag structure 902 is much larger than the extended way tagstructure 903, where, for example, (j) can cover 1024 entries (e.g., 10bits) while (k) can cover 256 (e.g., 8 bits).

This enables embodiments of the present invention to incorporateadditional ways for matching traces that have become very hot (e.g.,very frequently encountered). For example, if a match within a hot setis not found in the tag structure 902, then by setting an extended waybit, the extended way tag structure can be used to store additional waysfor the hot trace. It should be noted that this variable cache structureuses storage only as needed for the cached code/data that we store onthe stack, for example, if any of the cache sets (the entries indicatedby the index bits) is never accessed during a particular phase of aprogram, then there will be no storage allocation for that set on thestack. This provides an efficient effective storage capacity increasecompared to typical caches where sets have fixed physical data storagefor each and every set.

There can be also bits to indicate that a set or group of sets are cold(e.g., meaning they have not been accesses in a long time). In this casethe stack storage for those sets looks like bubbles within the allocatedstack storage. At that time, their allocation pointers can be claimedfor other hot sets. This process is a storage reclamation process, whereafter a chunk has been allocated within the stack, the whole set towhich that chunk belongs become later cold. The needed mechanisms andstructures (not shown in FIG. 9 in order not to clutter or obscure theaspects shown) that can facilitate this reclamation are: a cold setindicator for every set (entry index) and a reclamation process wherethe pointers for the ways of those cold sets are reused for other hotset's ways. This allows those stack storage bubbles (chunks) to bereclaimed. When not in reclamation mode, a new chunk is allocated on topof the stack, when the stack has cold sets (e.g., the set ways/chunksare not accessed in a long time) a reclamation action allow a new chunkthat needs to be allocated in another set to reuse the reclaimed pointerand its associated chunk storage (that belongs to a cold set) within thestack.

It should be noted that the FIG. 9 embodiment is well-suited to usestandard memory in its implementation as opposed to specialized cachememory. This attribute is due to the fact that the physical storagestack is managed by reading the pointers, reading indexes, andallocating address ranges. Specialized cache-based circuit structuresare not needed in such an implementation.

It should be noted that in one embodiment, the FIG. 9 architecture canbe used to implement data caches and caching schemes that do not involveconversion or code transformation. Consequently, the FIG. 9 architecturecan be used to implement more standardized caches (e.g., L2 data cache,etc.). Doing so would provide a larger effective capacity in comparisonto a conventional fixed structure cache, or the like.

FIG. 10 shows a diagram depicting additional exemplary details of ahardware accelerated conversion system 1000 in accordance with oneembodiment of the present invention. The line 1001 illustrates themanner in which incoming guests instructions are compared against aplurality of group masks and tags. The objective is to quickly identifythe type of guest instruction and assign it to a corresponding group.The group masks and tags function by matching subfields of the guestinstruction in order to identify particular groups to which the guestinstruction belongs. The mask obscures irrelevant bits of the guestinstruction pattern to look particularly at the relevant bits. Thetables, such as for example, table 1002, stores the mask-tag pairs in aprioritized manner.

A pattern is matched by reading into the table in the prioritydirection, which is depicted in this case being from the top down. Inthis manner, a pattern is matched by reading in the priority directionof the mask-tag storage. The different masks examined in order of theirpriority and the pattern matching functionality is correspondinglyapplied in order of their priority. When a hit is found, then thecorresponding mapping of the pattern is read from a corresponding tablestoring the mappings (e.g., table 1003). The 2nd level tables 1004illustrates the hierarchical manner in which multiple conversion tablescan be accessed in a cascading sequential manner until a full conversionof the guest instruction is achieved. As described above, the conversiontables include substitute fields and control fields and function asmultilevel conversion tables for translating guest instructions receivedfrom the guest fetch buffer into native instructions.

In this manner, each byte stream in the buffer sent to conversion tableswhere each level of conversion table serially detects bit fields. As therelevant bit fields are detected, the table substitutes the nativeequivalence of the field.

The table also produces a control field that helps the substitutionprocess for this level as well as the next level table (e.g., the 2ndlevel table 1004). The next table uses the previous table control filedto identify next relevant bit field, which is in substituted with thenative equivalence. The second level table can then produce controlfield to help a first level table, and so on. Once all guest bit fieldsare substituted with native bit fields, the instruction is fullytranslated and is transmitted to the native conversion buffer. Thenative conversion buffer is then written into the code cache and itsguest to native address mappings are logged in the CLB, as describedabove.

FIG. 11A shows a diagram of an exemplary pattern matching processimplemented by embodiments of the present invention. As depicted in FIG.11A, destination is determined by the tag, the pattern, and the mask.The functionality of the pattern decoding comprises performing a bitcompare (e.g., bitwise XOR), performing a bit AND (e.g., bitwise AND),and subsequently checking all zero bits (e.g., NOR of all bits).

FIG. 11B shows a diagram 1100 of a SIMD register based pattern matchingprocess in accordance with one embodiment of the present invention. Asdepicted in diagram 1100, four SIMD registers 1102-1105 are shown. Theseregisters implement the functionality of the pattern decoding process asshown. An incoming pattern 1101 is used to perform a parallel bitcompare (e.g., bitwise XOR) on each of the tags, and the result performsa bit AND with the mask (e.g., bitwise AND). The match indicator resultsare each stored in their respective SIMD locations as shown. A scan isthen performed as shown, and the first true among the SIMD elementsencountered by the scan is the element where the equation (Pi XOR Ti)AND Mi=0 for all i bits is true, where Pi is the respective pattern, Tiis the respective tag and Mi is the respective mask.

FIG. 12 shows a diagram of a unified register file 1201 in accordancewith one embodiment of the present invention. As depicted in FIG. 12,the unified register file 1201 includes 2 portions 1202-1203 and anentry selector 1205. The unified register file 1201 implements supportfor architecture speculation for hardware state updates.

The unified register file 1201 enables the implementation of anoptimized shadow register and committed register state managementprocess. This process supports architecture speculation for hardwarestate updating. Under this process, embodiments of the present inventioncan support shadow register functionality and committed registerfunctionality without requiring any cross copying between registermemory. For example, in one embodiment, the functionality of the unifiedregister file 1201 is largely provided by the entry selector 1205. Inthe FIG. 12 embodiment, each register file entry is composed from 2pairs of registers, R & R′, which are from portion 1 and the portion 2,respectively. At any given time, the register that is read from eachentry is either R or R′, from portion 1 or portion 2. There are 4different combinations for each entry of the register file based on thevalues of x & y bits stored for each entry by the entry selector 105.

The values for the x & y bits are as follows.

00: R not Valid; R’ committed (upon read request R’ is read) 01: Rspeculative; R’ committed (upon read request R is read) 10: R committed;R’ speculative (upon read request R’ is read) 11: R committed; R’ notValid (upon read request R is read)

The following are the impact of each instruction/event. Upon InstructionWrite back, 00 becomes 01 and 11 becomes 10. Upon instruction commit, 01becomes 11 and 10 becomes 00. Upon the occurrence of a rollback event,01 becomes 00 and 10 becomes 11.

These changes are mainly changes to the state stored in the registerfile entry selector 1205 and happen based on the events as they occur.It should be noted that commit instructions and roll back events need toreach a commit stage in order to cause the bit transition in the entryselector 1205.

In this manner, execution is able to proceed within the shadow registerstate without destroying the committed register state. When the shadowregister state is ready for committing, the register file entry selectoris updated such that the valid results are read from which portion inthe manner described above. In this manner, by simply updating theregister file entry selector as needed, speculative execution resultscan be rolled back to most recent commit point in the event of anexception. Similarly, the commit point can be advanced forward, therebycommitting the speculative execution results, by simply updating theregister file entry selectors. This functionality is provided withoutrequiring any cross copying between register memory.

In this manner, the unified register file can implement a plurality ofspeculative scratch shadow registers (SSSR) and a plurality of committedregisters (CR) via the register file entry selector 1205. For example,on a commit, the SSSR registers become CR registers. On roll back SSSRstate is rolled back to the CR registers.

FIG. 13 shows a diagram of a unified shadow register file and pipelinearchitecture 1300 that supports speculative architectural states andtransient architectural states in accordance with one embodiment of thepresent invention.

The FIG. 13 embodiment depicts the components comprising thearchitecture 1300 that supports instructions and results comprisingarchitecture speculation states and supports instructions and resultscomprising transient states. As used herein, a committed architecturestate comprises visible registers and visible memory that can beaccessed (e.g., read and write) by programs executing on the processor.In contrast, a speculative architecture state comprises registers and/ormemory that is not committed and therefore is not globally visible.

In one embodiment, there are four usage models that are enabled by thearchitecture 1300. A first usage model includes architecture speculationfor hardware state updates, as described above in the discussion of FIG.12.

A second usage model includes dual scope usage. This usage model appliesto the fetching of 2 threads into the processor, where one threadexecutes in a speculative state and the other thread executes in thenon-speculative state. In this usage model, both scopes are fetched intothe machine and are present in the machine at the same time.

A third usage model includes the JIT (just-in-time) translation orcompilation of instructions from one form to another. In this usagemodel, the reordering of architectural states is accomplished viasoftware, for example, the JIT. The third usage model can apply to, forexample, guest to native instruction translation, virtual machine tonative instruction translation, or remapping/translating native microinstructions into more optimized native micro instructions.

A fourth usage model includes transient context switching without theneed to save and restore a prior context upon returning from thetransient context. This usage model applies to context switches that mayoccur for a number of reasons. One such reason could be, for example,the precise handling of exceptions via an exception handling context.The second, third, and fourth usage models are further described in thediscussions of FIGS. 14-17 below.

Referring again to FIG. 13, the architecture 1300 includes a number ofcomponents for implementing the 4 usage models described above. Theunified shadow register file 1301 includes a first portion, committedregister file 1302, a second portion, the shadow register file 1303, anda third portion, the latest indicator array 1304. A speculativeretirement memory buffer 1342 and a latest indicator array 1340 areincluded. The architecture 1300 comprises an out of order architecture,hence, the architecture 1300 further includes a reorder buffer andretirement window 1332. The reorder and retirement window 1332 furtherincludes a machine retirement pointer 1331, a ready bit array 1334 and aper instruction latest indicator, such as indicator 1333.

The first usage model, architecture speculation for hardware stateupdates, is further described in detail in accordance with oneembodiment of the present invention. As described above, thearchitecture 1300 comprises a out of order architecture. The hardware ofthe architecture 1300 able to commit out of order instruction results(e.g., out of order loads and out of order stores and out of orderregister updates). The architecture 1300 utilizes the unified shadowregister file in the manner described in discussion of FIG. 12 above tosupport speculative execution between committed registers and shadowregisters. Additionally, the architecture 1300 utilizes the speculativeload store buffer 1320 and the speculative retirement memory buffer 1342to support speculative execution.

The architecture 1300 will use these components in conjunction withreorder buffer and retirement window 1332 to allow its state to retirecorrectly to the committed register file 1302 and to the visible memory1350 even though the machine retired those in out of order mannerinternally to the unified shadow register file and the retirement memorybuffer. For example, the architecture will use the unified shadowregister file 1301 and the speculative memory 1342 to implement rollbackand commit events based upon whether exceptions occur or do not occur.This functionality enables the register state to retire out of order tothe unified shadow register file 1301 and enables the speculativeretirement memory buffer 1342 to retire out of order to the visiblememory 1350. As speculative execution proceeds and out of orderinstruction execution proceeds, if no branch has been missed predictedand there are no exceptions that occur, the machine retirement pointer1331 advances until a commit event is triggered. The commit event causesthe unified shadow register file to commit its contents by advancing itscommit point and causes the speculative retirement memory buffer tocommit its contents to the memory 1350 in accordance with the machineretirement pointer 1331.

For example, considering the instructions 1-7 that are shown within thereorder buffer and retirement window 1332, the ready bit array 1334shows an “X” beside instructions are ready to execute and a “I” besideinstructions that are not ready to execute. Accordingly, instructions 1,2, 4, and 6 are allowed to proceed out of order. Subsequently, if anexception occurs, such as the instruction 6 branch being miss-predicted,the instructions that occur subsequent to instruction 6 can be rolledback. Alternatively, if no exception occurs, all of the instructions 1-7can be committed by moving the machine retirement pointer 1331accordingly.

The latest indicator array 1341, the latest indicator array 1304 and thelatest indicator 1333 are used to allow out of order execution. Forexample, even though instruction 2 loads register R4 beforeinstruction5, the load from instruction 2 will be ignored once theinstruction 5 is ready to occur. The latest load will override theearlier load in accordance with the latest indicator.

In the event of a branch prediction or exception occurring within thereorder buffer and retirement window 1332, a rollback event istriggered. As described above, in the event of a rollback, the unifiedshadow register file 1301 will rollback to its last committed point andthe speculative retirement memory buffer 1342 will be flushed.

FIG. 14 shows a diagram 1400 of the second usage model, including dualscope usage in accordance with one embodiment of the present invention.As described above, this usage model applies to the fetching of 2threads into the processor, where one thread executes in a speculativestate and the other thread executes in the non-speculative state. Inthis usage model, both scopes are fetched into the machine and arepresent in the machine at the same time.

As shown in diagram 1400, 2 scope/traces 1401 and 1402 have been fetchedinto the machine. In this example, the scope/trace 1401 is a currentnon-speculative scope/trace. The scope/trace 1402 is a new speculativescope/trace. Architecture 1300 enables a speculative and scratch statethat allows 2 threads to use those states for execution. One thread(e.g., 1401) executes in a non-speculative scope and the other thread(e.g., 1402) uses the speculative scope. Both scopes can be fetched intothe machine and be present at the same time, with each scope set itsrespective mode differently. The first is non-speculative and the otheris speculative. So the first executes in CR/CM mode and the otherexecutes in SR/SM mode. In the CR/CM mode, committed registers are readand written to, and memory writes go to memory. In the SR/SM mode,register writes go to SSSR, and register reads come from the latestwrite, while memory writes the retirement memory buffer (SMB).

One example will be a current scope that is ordered (e.g., 1401) and anext scope that is speculative (e.g., 1402). Both can be executed in themachine as dependencies will be honored because the next scope isfetched after the current scope. For example, in scope 1401, at the“commit SSSR to CR”, registers and memory up to this point are in CRmode while the code executes in CR/CM mode. In scope 1402, the codeexecutes in SR and SM mode and can be rolled back if an exceptionhappens. In this manner, both scopes execute at the same time in themachine but each is executing in a different mode and reading andwriting registers accordingly.

FIG. 15 shows a diagram 1500 of the third usage model, includingtransient context switching without the need to save and restore a priorcontext upon returning from the transient context in accordance with oneembodiment of the present invention. As described above, this usagemodel applies to context switches that may occur for a number ofreasons. One such reason could be, for example, the precise handling ofexceptions via an exception handling context.

In the 3rd usage model occurs when the machine is executing translatedcode and it encounters a context switch (e.g., exception inside of thetranslated code or if translation for subsequent code is needed). In thecurrent scope (e.g., prior to the exception), SSSR and the SMB have notyet committed their speculative state to the guest architecture state.The current state is running in SR/SM mode. When the exception occursthe machine switches to an exception handler (e.g., a convertor) to takecare of exception precisely. A rollback is inserted, which causes theregister state to roll back to CR and the SMB is flushed. The convertorcode will run in SR/CM mode. During execution of convertor code the SMBis retiring its content to memory without waiting for a commit event.The registers are written to SSSR without updating CR. Subsequently,when the convertor is finished and before switching back to executingconverted code, it rolls back the SSSR (e.g., SSSR is rolled back toCR). During this process the last committed Register state is in CR.

This is shown in diagram 1500 where the previous scope/trace 1501 hascommitted from SSSR into CR. The current scope/trace 1502 isspeculative. Registers and memory and this scope are speculative andexecution occurs under SR/SM mode. In this example, an exception occursin the scope 1502 and the code needs to be re-executed in the originalorder before translation. At this point, SSSR is rolled back and the SMBis flushed. Then the JIT code 1503 executes. The JIT code rolls backSSSR to the end of scope 1501 and flushes the SMB. Execution of the JITis under SC/CM mode. When the JIT is finished, the SSSR is rolled backto CR and the current scope/trace 1504 then re-executes in the originaltranslation order in CR/CM mode. In this manner, the exception ishandled precisely at the exact current order.

FIG. 16 shows a diagram 1600 depicting a case where the exception in theinstruction sequence is because translation for subsequent code isneeded in accordance with one embodiment of the present invention. Asshown in diagram 1600, the previous scope/trace 1601 concludes with afar jump to a destination that is not translated. Before jumping to afar jump destination, SSSR is committed to CR. The JIT code 1602 thenexecutes to translate the guess instructions at the far jump destination(e.g., to build a new trace of native instructions). Execution of theJIT is under SR/CM mode. At the conclusion of JIT execution, theregister state is rolled back from SSSR to CR, and the new scope/trace1603 that was translated by the JIT begins execution. The newscope/trace continues execution from the last committed point of theprevious scope/trace 1601 in the SR/SM mode.

FIG. 17 shows a diagram 1700 of the fourth usage model, includingtransient context switching without the need to save and restore a priorcontext upon returning from the transient context in accordance with oneembodiment of the present invention. As described above, this usagemodel applies to context switches that may occur for a number ofreasons. One such reason could be, for example, the processing inputs oroutputs via an exception handling context.

Diagram 1700 shows a case where a previous scope/trace 1701 executingunder CR/CM mode ends with a call of function F1. Register state up tothat point is committed from SSSR to CR. The function F1 scope/trace1702 then begins executing speculatively under SR/CM mode. The functionF1 then ends with a return to the main scope/trace 1703. At this point,the register state is rollback from SSSR to CR. The main scope/trace1703 resumes executing in the CR/CM mode.

FIG. 18 shows a diagram of an exemplary microprocessor pipeline 1800 inaccordance with one embodiment of the present invention. Themicroprocessor pipeline 1800 includes a hardware conversion accelerator1810 that implements the functionality of the hardware accelerationconversion process, as described above. In the FIG. 18 embodiment, thehardware conversion accelerator 1810 is coupled to a fetch module 1801which is followed by a decode module 1802, an allocation module 1803, adispatch module 1804, an execution module 1805 and a retirement modules1806. It should be noted that the microprocessor pipeline 1800 is justone example of the pipeline that implements the functionality ofembodiments of the present invention described above. One skilled in theart would recognize that other microprocessor pipelines can beimplemented that include the functionality of the decode moduledescribed above.

The foregoing description, for the purpose of explanation, has beendescribed with reference to specific embodiments. However, theillustrated discussions above are not intended to be exhaustive or tolimit the invention to the precise forms disclosed. Many modificationsand variations are possible in view of the above teachings. Embodimentswere chosen and described in order to best explain the principles of theinvention and its practical applications, to thereby enable othersskilled in the art to best utilize the invention and various embodimentswith various modifications as may be suited to the particular usecontemplated.

What is claimed is:
 1. A hardware based translation accelerator,comprising: a guest fetch logic component for accessing a plurality ofguest instructions; a guest fetch buffer coupled to the guest fetchlogic component and a branch prediction component for assembling theplurality of guest instructions into a guest instruction block; aplurality of conversion tables coupled to the guest fetch buffer fortranslating the guest instruction block into a corresponding nativeconversion block; a native cache coupled to the conversion tables forstoring the corresponding native conversion block; a conversion lookaside buffer coupled to the native cache for storing a mapping of aguest far branch in the guest instruction block to a correspondingnative instruction; wherein upon a subsequent request for the guest farbranch, the conversion look aside buffer is indexed to determine whethera hit occurred, wherein the mapping indicates the guest far branch has acorresponding converted native instruction in the native cache; and inresponse to the hit the conversion look aside buffer forwards thetranslated native instruction for execution.
 2. The hardware basedtranslation accelerator of claim 1, wherein a hardware fetch logiccomponent prefetches the plurality of guest instructions independent ofthe processor.
 3. The hardware based translation accelerator of claim 1,wherein the conversion look aside buffer comprises a cache that uses areplacement policy to maintain most frequently encountered mappingsstored therein.
 4. The hardware based translation accelerator of claim1, wherein a conversion buffer is maintained within a system memory andcache coherency is maintained between the conversion look aside bufferand the conversion buffer.
 5. The hardware based translation acceleratorof claim 4, wherein the conversion buffer is larger than the conversionlook aside buffer, and a write back policy is used to maintain coherencybetween the conversion buffer and the conversion look aside buffer. 6.The hardware based translation accelerator of claim 1, wherein theconversion look aside buffer is implemented as a high-speed low latencycache memory coupled to a pipeline of the processor.
 7. A system foraccelerating the translation of guest instructions to nativeinstructions for a processor, comprising: a guest fetch logic componentfor accessing a plurality of guest instructions; a guest fetch buffercoupled to the guest fetch logic component and a branch predictioncomponent for assembling the plurality of guest instructions into aguest instruction block; a plurality of conversion tables coupled to theguest fetch buffer for translating the guest instruction block into acorresponding native conversion block; a native cache coupled to theconversion tables for storing the corresponding native conversion block;a conversion look aside buffer coupled to the native cache for storing amapping of a guest far branch in the guest instruction block to acorresponding native instruction; wherein upon a subsequent request forthe guest far branch, the conversion look aside buffer is indexed todetermine whether a hit occurred, wherein the mapping indicates theguest far branch has a corresponding converted native instruction in thenative cache; and in response to the hit the conversion look asidebuffer forwards the translated native instruction for execution.
 8. Thesystem of claim 7, wherein a hardware fetch logic component prefetchesthe plurality of guest instructions independent of the processor.
 9. Thesystem of claim 7, wherein the conversion look aside buffer comprises acache that uses a replacement policy to maintain most frequentlyencountered mappings stored therein.
 10. The system of claim 7, whereina conversion buffer is maintained within a system memory and cachecoherency is maintained between the conversion look aside buffer and theconversion buffer.
 11. The system of claim 10, wherein the conversionbuffer is larger than the conversion look aside buffer, and a write backpolicy is used to maintain coherency between the conversion buffer andthe conversion look aside buffer.
 12. The system of claim 7, wherein theconversion look aside buffer is implemented as a high-speed low latencycache memory coupled to a pipeline of the processor.
 13. Amicroprocessor that implements a method of translating instructions,said microprocessor comprises: a microprocessor pipeline; a hardwareaccelerator module coupled to the microprocessor pipeline, wherein thehardware accelerator module further comprises: a guest fetch logiccomponent for accessing a plurality of guest instructions; a guest fetchbuffer coupled to the guest fetch logic component and a branchprediction component for assembling the plurality of guest instructionsinto a guest instruction block; a plurality of conversion tables coupledto the guest fetch buffer for translating the guest instruction blockinto a corresponding first native conversion block; a native cachecoupled to the conversion tables for storing the first correspondingnative conversion block; a conversion look aside buffer coupled to thenative cache for storing a mapping of a guest far branch in the guestinstruction block to a corresponding native instruction, wherein thecorresponding native instruction is a first instruction in a secondnative conversion block; wherein upon a subsequent request for the guestfar branch, the conversion look aside buffer is indexed to determinewhether a hit occurred, wherein the mapping indicates the guest farbranch has the second native conversion block in the native cache; andin response to the hit the conversion look aside buffer forwards thesecond native conversion block for execution.
 14. The microprocessor ofclaim 13, wherein a hardware fetch logic component prefetches theplurality of guest instructions independent of the processor.
 15. Themicroprocessor of claim 13, wherein the conversion look aside buffercomprises a cache that uses a replacement policy to maintain mostfrequently encountered mappings stored therein.
 16. The microprocessorof claim 13, wherein a conversion buffer is maintained within a systemmemory and cache coherency is maintained between the conversion lookaside buffer and the conversion buffer.
 17. The microprocessor of claim16, wherein the conversion buffer is larger than the conversion lookaside buffer, and a write back policy is used to maintain coherencybetween the conversion buffer and the conversion look aside buffer. 18.The microprocessor of claim 13, wherein the conversion look aside bufferis implemented as a high-speed low latency cache memory coupled to apipeline of the processor.
 19. The microprocessor of claim 13, whereinthe hardware accelerator module functions comprises a parallel guestinstruction fetch pipeline that functions in parallel to a nativemicroprocessor fetch pipeline.
 20. A microprocessor that implements amethod of translating instructions, said microprocessor comprises: amicroprocessor pipeline; an accelerator module comprising high speedmemory coupled to the microprocessor pipeline, wherein the acceleratormodule further comprises: a guest fetch logic for accessing a pluralityof guest instructions; a guest fetch memory coupled to the guest fetchlogic for assembling the plurality of guest instructions into a guestinstruction block; a plurality of conversion tables for translating theguest instruction block into a corresponding native conversion block,wherein the guest instruction block comprises a guest far branchinstruction; a native conversion buffer for storing the correspondingnative conversion block; wherein upon a subsequent request for the guestfar branch, a conversion look aside buffer storing a mapping of theguest far branch in the guest instruction block to a correspondingnative instruction is indexed to determine whether a hit occurred,wherein the mapping indicates the guest far branch has a correspondingconverted native instruction in the native cache; and in response to thehit the conversion look aside buffer forwards the translated nativeinstruction for execution.
 21. The microprocessor of claim 20, whereinthe high speed memory comprises an L0 cache of the microprocessor. 22.The microprocessor of claim 20, wherein the accelerator module furthercomprises a load store instruction fetch path of the microprocessor.